University of Oslo
November 26, 2024
| Session | Content | Readings |
|---|---|---|
| Morning | – ML and supervised learning | Prince, chs. 1–2 |
| (9.00–12.00) | – Python and PyTorch | |
| – Demo 1: 1D Linear regression | ||
| – Shallow neural networks | Prince, ch. 3 | |
| – Deep neural networks | Prince, ch. 4 | |
| Afternoon | – Loss functions | Prince, ch. 5 |
| (13.00–16.00) | – Fitting models | Prince, chs. 6–7 |
| – Demo 2: Gradient descent | ||
| – Measuring performance | Prince, ch. 8 | |
| – Demo 3: Fully connected deep NNs |
| Session | Content | Readings |
|---|---|---|
| Morning | – Convolutional neural networks (CNNs) | Prince, ch. 10 |
| (9.00–12.00) | – Demo 4: CNNs | |
| Afternoon | – Transformers | Prince, ch. 12 |
| (13.00–16.00) | – Demo 5: Transformers |
You will need for the course:
The textbook for the course is:
Other useful deep learning books are:
Course demos use Jupyter notebooks hosted on Google Colab and GitHub.
| Demo | Content | Notebook | Google Colab | GitHub |
|---|---|---|---|---|
| Demo 1 | 1D linear regression w/ scikit-learn |
Notebook 1 | Colab nb-1 | GitHub nb-1 |
| Demo 2 | 1D linear regression w/ gradient descent | |||
| Demo 3 | Fully connected deep neural networks | Notebook 2 | Colab nb-2 | GitHub nb-2 |
| Demo 4 | Convolutional neural networks (CNNs) | Notebook 3 | Colab nb-3 | GitHub nb-3 |
| Demo 5 | Transformers | Notebook 4 | Colab nb-4 | GitHub nb-4 |
Artificial Intelligence (AI)
Artificial intelligence (AI) is the field concerned with building systems that simulate intelligent behavior.
Machine Learning (ML)
Machine learning (ML) is the study of algorithms that can learn from experience. Learning means that as an algorithm accumulates experience, its performance improves.
ML is a subfield of AI.
Deep Learning (DL)
Deep learning (DL) uses deep neural networks, which are a type of ML model, to learn from experience.
DL is a subfield of ML.
Supervised Learning
Supervised learning models learn a mapping from one or more inputs to one or more outputs. They learn this mapping from labeled training data, which are examples of input/output pairs.
Supervised learning deals with regression and classification problems.
Regression Problems
In regression problems the goal is to map inputs to continuous outputs.
In a univariate regression problem the output is a real-valued scalar. In a multivariate regression problem the output is a real-valued vector.
Classification Problems
In classification problems the goal is to map inputs to categorical outputs.
In a binary classification problem the output has two categories. In a multiclass classification problem the output has more than two categories.
Unsupervised Learning
Unsupervised learning models aim to describe patterns or structure in data. They learn such patterns or structure from unlabeled training data, which are examples of inputs (w/o corresponding output labels).
Generative Unsupervised Learning
Generative unsupervised models learn to create new data examples that are statistically indistinguishable but distinct from the training examples.
Generative unsupervised models are a subset of unsupervised learning models. They have recently gained prominence (e.g., in generating synthetic survey data).1
Reinforcement Learning
Reinforcement learning is characterized by an agent (learner) that
The goal is for the agent to learn actions that change the state in a way leading to high rewards (e.g., chess).
The learner does not receive any data, but has to explore the environment to gather data as it goes.
The performance of deep learning models is impressive for many learning tasks (e.g., image classification, speech recognition, language translation, etc.).
Should we therefore use deep learning on every learning problem?
We expect deep learning models to perform well when the training set is very large and can support the fitting of high-dimensional nonlinear models.
On smaller data sets, simpler models (such as linear regression, lasso, etc.) may perform similarly well, yet they are easier to fit and interpret.1
Therefore, it can make sense to fit deep neural networks as well as simpler models and then compare them based on their performance.
In supervised learning, we build a model \(\mathbf{f}[\cdot]\) that maps input \(\mathbf{x}\) to output \(\mathbf{y}\), \[ \mathbf{y} = \mathbf{f}[\mathbf{x}, \boldsymbol{\phi}]. \]
The model \(\mathbf{f}[\cdot]\) is a function, which describes a family of possible relationships between input \(\mathbf{x}\) and output \(\mathbf{y}\).
The model also contains parameters \(\boldsymbol{\phi}\). The parameter values define the particular relationship between input and output.
We typically learn the parameters by feeding an algorithm with a training data set \(\mathcal{S}\) that contains \(N\) pairs of input/output examples, \[ \mathcal{S} = \{(x_{i}, y_{i})\}_{i = 1}^{N}. \]
The goal of the algorithm is to choose parameters that map inputs to outputs as closely as possible.
To quantify the degree of mismatch in the mapping, we define a loss function \(L[\boldsymbol{\phi}, \mathcal{S}]\) or, for short, \(L[\boldsymbol{\phi}]\).
When training the model, the algorithm chooses parameters \(\boldsymbol{\hat{\phi}}\) that minimize the loss function, \[ \boldsymbol{\hat{\phi}} = \underset{\boldsymbol{\phi}}{\mathrm{argmin}}\big[L[\boldsymbol{\phi}]\big]. \]
After training the model, we assess its performance on a separate test data set to see how well it generalizes to new examples.
If the performance is adequate, we can deploy our trained model \(\mathbf{f}[\mathbf{x}, \boldsymbol{\hat{\phi}}]\).
The process of finding the parameters that minimize the loss function is called model fitting, training, or learning.
The basic method for model fitting is the gradient descent algorithm:
After training the model, we want to assess how well it generalizes to new data. We do this by computing the loss of the trained model \(f[x, \boldsymbol{\hat{\phi}}]\) on a separate test data set.
The trained model might not generalize well because:
Demo 1: 1D Linear Regression
The Python programming language is widely used in data science due to its readability and vast ecosystem of libraries.
Python is a dynamically typed language, so the data type of a variable is determined at runtime and does not need to be declared beforehand.
Core Python libraries for data science include:
Training deep learning models typically requires optimization of a large number of parameters.
We need a framework that can handle large-scale optimization efficiently by utilizing graphics processing units (GPUs). PyTorch is one such framework.1
PyTorch is one of the most popular frameworks for deep learning and has a tight integration with Python.
In PyTorch, the fundamental data structure is the tensor, which is a multi-dimensional array:
Tensors can run on central processing units (CPUs) or graphics processing units (GPUs).
PyTorch uses dynamic computation graphs, which are constructed at runtime and are therefore more flexible and easier to debug than static graphs.1
We previously used a 1D linear regression model to describe the relationship between input \(x\) and output \(y\).
However, this model can only describe the input/output relationship as a line.
Shallow neural networks are a more flexible model.
They describe piecewise linear functions and can approximate arbitrariliy complex relationships between inputs and outputs.
Shallow neural networks are functions \(\mathbf{f}[\mathbf{x}, \boldsymbol{\phi}]\) with parameters \(\boldsymbol{\phi}\) that map inputs \(\mathbf{x}\) to outputs \(\mathbf{y}\).
Consider a 1D shallow neural network \(f[x, \boldsymbol{\phi}]\) that maps a scalar input \(x\) to a scalar output \(y\) and has 10 parameters \(\boldsymbol{\phi}\), \[ \begin{split} y &= f[x, \boldsymbol{\phi}] \\ &= \phi_{0} + \phi_{1}a[\theta_{10} + \theta_{11}x] + \phi_{2}a[\theta_{20} + \theta_{21}x] + \phi_{3}a[\theta_{30} + \theta_{31}x]. \end{split} \qquad(1)\]
This model has three parts:
How do we define the activation function \(a[\cdot]\)?
The most common choice for \(a[\cdot]\) is the rectified linear unit (ReLU), \[ a[z] = \mathrm{ReLU}[z] = \begin{cases} 0 & \text{if } z < 0, \\ z & \text{if } z \geq 0. \end{cases} \]
This returns the input \(z\) when \(z\) is positive and 0 when it is negative (it clips negative values of \(z\) to 0).
Having defined \(a[\cdot]\), we can proceed by specifying a least squares loss function \(L[\boldsymbol{\phi}]\) and train the model on a training data set \(\mathcal{S} = \{(x_{i}, y_{i})\}_{i = 1}^{N}\) by searching for the values \(\boldsymbol{\hat{\phi}}\) that minimize \(L[\boldsymbol{\phi}]\).
Equation 1 represents a family of continuous piecewise linear functions with up to four linear regions.
To see this, we split the function into two parts:
Figure 6 (next slide) illustrates the flow of computation for the model described by Equation 1, using the ReLU activation function.
Each linear region in panel j) corresponds to a different activation pattern in the hidden units.
When a unit is clipped, we say it is inactive; and when it is not clipped, we say it is active.
We visualize the above network as in Figure 7.
Figure 8 shows some neural network terminology.
With ReLU activation functions, the output of a network with \(D\) hidden units is a piecewise linear function with at most \(D + 1\) linear regions.
As we increase the number of hidden units \(D\), the model can approximate more complex functions.
Universal Approximation Theorem
With enough hidden units, a shallow neural network can describe any continuous function on a compact subset of \(\mathbb{R}^{D}\) to arbitrary precision.
The above discussion generalizes to the case of a shallow neural network with an arbitrary number of inputs, hidden units, and outputs.
In this general case, a shallow neural network \(\mathbf{f}[\mathbf{x}, \boldsymbol{\phi}]\) maps a multi-dimensional input \(\mathbf{x} \in \mathbb{R}^{D_{i}}\) to a multi-dimensional output \(\mathbf{y} \in \mathbb{R}^{D_{o}}\) using hidden units \(\mathbf{h} \in \mathbb{R}^{D}\).
Each hidden unit \(h_{d}\), for \(d = 1, \dotsc, D\), is computed as: \[ h_{d} = a\Big[\theta_{d0} + \sum_{i = 1}^{D_{i}}\theta_{di}x_{i}\Big], \] where \(a[\cdot]\) is a non-linear activation function (e.g., ReLU).
Note: The activation function \(a[\cdot]\) allows the model to describe non-linear relations between \(\mathbf{x}\) and \(\mathbf{y}\) and, therefore, has to be non-linear (such as ReLU).
Shallow neural networks have one hidden layer.
Deep neural networks have more than one hidden layer.
Both shallow and deep neural networks describe piecewise linear mappings from input to output (with ReLU activation functions).
With enough hidden units, shallow neural networks can describe arbitrarily complex functions in high dimensions (universal approximation theorem).
Deep neural networks are more practical.
To better understand the behavior of deep neural networks, we first consider connecting two shallow ones (with three hidden units each).
The first shallow network takes input \(x\) and returns output \(y\). The second shallow network takes \(y\) as input and returns output \(y^{\prime}\), see Figure 11, panel a). \[ \begin{split} h_{1} &= a[\theta_{10} + \theta_{11}x] \\ \text{Network 1:} \qquad h_{2} &= a[\theta_{20} + \theta_{21}x] \qquad y = \phi_{0} + \phi_{1}h_{1} + \phi_{2}h_{2} + \phi_{3}h_{3} \\ h_{3} &= a[\theta_{30} + \theta_{31}x]. \end{split} \] \[ \begin{split} h_{1}^{\prime} &= a[\theta_{10}^{\prime} + \theta_{11}^{\prime}y] \\ \text{Network 2:} \qquad h_{2}^{\prime} &= a[\theta_{20}^{\prime} + \theta_{21}^{\prime}y] \qquad y^{\prime} = \phi_{0}^{\prime} + \phi_{1}^{\prime}h_{1}^{\prime} + \phi_{2}^{\prime}h_{2}^{\prime} + \phi_{3}^{\prime}h_{3}^{\prime} \\ h_{3}^{\prime} &= a[\theta_{30}^{\prime} + \theta_{31}^{\prime}y]. \end{split} \]
Figure 11, panel b).
Figure 11, panel c).
Figure 11, panel d).
\[ \begin{split} h_{1} &= a[\theta_{10} + \theta_{11}x] \\ \text{Network 1:} \qquad h_{2} &= a[\theta_{20} + \theta_{21}x] \qquad \color{red}{y = \phi_{0} + \phi_{1}h_{1} + \phi_{2}h_{2} + \phi_{3}h_{3}} \\ h_{3} &= a[\theta_{30} + \theta_{31}x]. \end{split} \] \[ \begin{split} h_{1}^{\prime} &= a[\theta_{10}^{\prime} + \theta_{11}^{\prime}\color{red}{y}\color{black}{]} \\ \text{Network 2:} \qquad h_{2}^{\prime} &= a[\theta_{20}^{\prime} + \theta_{21}^{\prime}\color{red}{y}\color{black}{]} \qquad y^{\prime} = \phi_{0}^{\prime} + \phi_{1}^{\prime}h_{1}^{\prime} + \phi_{2}^{\prime}h_{2}^{\prime} + \phi_{3}^{\prime}h_{3}^{\prime} \\ h_{3}^{\prime} &= a[\theta_{30}^{\prime} + \theta_{31}^{\prime}\color{red}{y}\color{black}{]}. \end{split} \]
\[ \begin{split} h_{1}^{\prime} &= a[\theta_{10}^{\prime} + \theta_{11}^{\prime}\color{red}{y}\color{black}{]} = a[\theta_{10}^{\prime} + \theta_{11}^{\prime}\color{red}{\phi_{0}} \color{black}{+} \theta_{11}^{\prime}\color{red}{\phi_{1}h_{1}} \color{black}{+} \theta_{11}^{\prime}\color{red}{\phi_{2}h_{2}} \color{black}{+} \theta_{11}^{\prime}\color{red}{\phi_{3}h_{3}}\color{black}{]} \\ h_{2}^{\prime} &= a[\theta_{20}^{\prime} + \theta_{21}^{\prime}\color{red}{y}\color{black}{]} = a[\theta_{20}^{\prime} + \theta_{21}^{\prime}\color{red}{\phi_{0}} \color{black}{+} \theta_{21}^{\prime}\color{red}{\phi_{1}h_{1}} \color{black}{+} \theta_{21}^{\prime}\color{red}{\phi_{2}h_{2}} \color{black}{+} \theta_{21}^{\prime}\color{red}{\phi_{3}h_{3}}\color{black}{]} \\ h_{3}^{\prime} &= a[\theta_{30}^{\prime} + \theta_{31}^{\prime}\color{red}{y}\color{black}{]} = a[\theta_{30}^{\prime} + \theta_{31}^{\prime}\color{red}{\phi_{0}} \color{black}{+} \theta_{31}^{\prime}\color{red}{\phi_{1}h_{1}} \color{black}{+} \theta_{31}^{\prime}\color{red}{\phi_{2}h_{2}} \color{black}{+} \theta_{31}^{\prime}\color{red}{\phi_{3}h_{3}}\color{black}{]}. \end{split} \]
\[ \begin{split} &\text{Hidden layer 1:} &\text{Hidden layer 2:} \\ &h_{1} = a[\theta_{10} + \theta_{11}x] \qquad &h_{1}^{\prime} = a[\psi_{10} + \psi_{11}h_{1} + \psi_{12}h_{2} + \psi_{13}h_{3}] \\ &h_{2} = a[\theta_{20} + \theta_{21}x] &h_{2}^{\prime} = a[\psi_{20} + \psi_{21}h_{1} + \psi_{22}h_{2} + \psi_{23}h_{3}] \\ &h_{3} = a[\theta_{30} + \theta_{31}x] &h_{3}^{\prime} = a[\psi_{30} + \psi_{31}h_{1} + \psi_{32}h_{2} + \psi_{33}h_{3}], \end{split} \qquad(2)\]
\[ \text{Output:} \quad y^{\prime} = \phi_{0}^{\prime} + \phi_{1}^{\prime}h_{1}^{\prime} + \phi_{2}^{\prime}h_{2}^{\prime} + \phi_{3}^{\prime}h_{3}^{\prime}. \]
\[ \begin{split} \psi_{10} &= \theta_{10}^{\prime} + \theta_{11}^{\prime}\phi_{0} \\ \psi_{11} &= \theta_{11}^{\prime}\phi_{1} \\ \psi_{12} &= \theta_{11}^{\prime}\phi_{2} \\ \psi_{13} &= \theta_{11}^{\prime}\phi_{3} \\ &\vdots \end{split} \qquad(3)\]
No. of hidden layers, \(K\), is called the depth of the network.
No. of hidden units in each layer, \(D_{1}, D_{2}, \dotsc, D_{K}\), is called the width of the network. Total no. of hidden units is a measure of the capacity of the network.
These parameters are hyperparameters \(\rightarrow\) they are chosen before training the network.
We can retrain the network with different hyperparameters \(\rightarrow\) hyperparameter optimization / hyperparameter search.
Both deep and shallow neural networks can approximate any continuous function, but the former can often do so (much) more efficiently (i.e., with fewer parameters). This is called the depth efficiency of neural networks.
For some structured inputs (e.g., images), we want to process local input regions in parallel and then gradually integrate information from increasingly larger regions. Such local-to-global processing is difficult to achieve without multiple layers.
It is easier to train (moderately) deep neural networks than it is to train shallow ones.
Deep neural networks seem to generalize better to new data than shallow neural networks.
We so far discussed:
Models:
Up next:
Loss functions:
When we train a model, we seek the model parameters \(\boldsymbol{\phi}\) that produce the best possible mapping from inputs to outputs.
To formalize “best possible” mapping, we need:
During training, we seek parameter values \(\boldsymbol{\phi}\) that minimize the loss function (and hence map the training inputs to the outputs as closely as possible).
We need a framework that allows us to build loss functions for different prediction tasks, such as
A commonly used framework is maximum likelihood, which gives us a principled approach to building loss functions.
So far, we assumed that a model \(\mathbf{f}[\mathbf{x}, \boldsymbol{\phi}]\) directly predicts output \(\boldsymbol{y}\) from input \(\boldsymbol{x}\).
The maximum likelihood framework takes a different perspective:
Steps for Constructing a Loss Function
| Prediction task | Output domain | Distribution |
|---|---|---|
| Univariate regression | \(y \in \mathbb{R}\) | Univariate normal |
| Multivariate regression | \(\mathbf{y} \in \mathbb{R}^{D_{o}}\) | Multivariate normal |
| Binary classification | \(y \in \{0, 1\}\) | Bernoulli |
| Multiclass classification | \(y \in \{1, 2, \dotsc, K\}\) | Categorical |
We so far discussed:
Models:
Loss functions:
Up next:
Model training:
To fit a model \(\mathbf{f}[\mathbf{x}, \boldsymbol{\phi}]\), we need a training data set of input/output pairs, \(\mathcal{S} = \{(\mathbf{x}_{i}, \mathbf{y}_{i})\}_{i = 1}^{N}\).
We seek parameters \(\boldsymbol{\phi}\) for the model \(\mathbf{f}[\mathbf{x}, \boldsymbol{\phi}]\) that map the inputs \(\mathbf{x}_{i}\) to the outputs \(\mathbf{y}_{i}\) as well as possible.
We therefore define a loss function \(L[\boldsymbol{\phi}]\) that quantifies the mismatch in this mapping.
We also need an optimization algorithm to find the parameters \(\boldsymbol{\hat{\phi}}\) that minimize the loss,
\[ \boldsymbol{\hat{\phi}} = \underset{\boldsymbol{\phi}}{\mathrm{argmin}}\big[L[\boldsymbol{\phi}]\big]. \]
The gradient descent (GD) algorithm starts by choosing initial parameters \(\boldsymbol{\phi} = [\phi_{0}, \phi_{1}, \dotsc, \phi_{M}]^{T}\).
It then iterates two steps (until it converges):
GD has converged when the gradient reaches zero and the parameters no longer change. In practice, the algorithm is typically stopped when the gradient magnitude falls below a predefined threshold.
Demo 2: Gradient Descent
For convex loss functions, there is a single (global) minimum (Figure 16 panel b).
Loss functions for non-linear models (such as shallow and deep NNs) are typically non-convex, i.e., they have local minima (Figure 16 panel a) and saddle points (Figure 16 panel c) in addition to the global minimum.
Stochastic gradient descent (SGD) tries to overcome the problems of GD by adding noise to the gradient at each step.
Panel a). If GD is initialized at a “right” point on the loss function (e.g., points 1 and 3), it will move to the global minimum. If it is initialized at a “wrong” point (e.g., point 2), it will move to a local minimum.
Panel b). By adding noise to the optimization process, SGD is able to escape from “wrong” valleys (e.g., point 2) and reach the global minimum
SGD introduces randomness by computing the gradient based on only a random subset (batch) of the training data at each iteration, \[ \boldsymbol{\phi}_{t + 1} \leftarrow \boldsymbol{\phi}_{t} - \alpha \sum_{i \in \mathcal{B}_{t}}\frac{\partial \ell_{i}[\boldsymbol{\phi}_{t}]}{\partial \boldsymbol{\phi}}, \] where \(\mathcal{B}_{t}\) is the set of indices of input/output pairs in the batch at the \(t\)th iteration and \(1 \leq |\mathcal{B}_{t}| \leq N\).
SGD draws batches without replacement from the training data set until it has used all the data.
After SGD has iterated through the training data set, it repeats the process. A single pass through the training data set is called an epoch.
SGD often uses a learning rate schedule: The learning rate \(\alpha\) starts at a high value and is decreased over epochs.
A modification to SGD is to add a momentum term, \[ \begin{split} \mathbf{m}_{t + 1} &\leftarrow \beta \cdot \mathbf{m}_{t} + (1 - \beta) \sum_{i \in \mathcal{B}_{t}}\frac{\partial \ell_{i}[\boldsymbol{\phi}_{t}]}{\partial \boldsymbol{\phi}} \\ \boldsymbol{\phi}_{t + 1} &\leftarrow \boldsymbol{\phi}_{t} - \alpha \cdot \mathbf{m}_{t + 1}, \end{split} \] where:
We therefore update the parameters with a weighted combination of the gradient computed from the current batch and the previous change.
GD with a fixed learning rate \(\alpha\) makes
This problem can be overcome by normalizing the gradients when updating the parameters, \[ \begin{split} \mathbf{m}_{t + 1} &\leftarrow \frac{\partial L[\boldsymbol{\phi}_{t}]}{\partial \boldsymbol{\phi}} \\ \mathbf{v}_{t + 1} &\leftarrow \bigg(\frac{\partial L[\boldsymbol{\phi}_{t}]}{\partial \boldsymbol{\phi}}\bigg)^{2} \\ \boldsymbol{\phi}_{t + 1} &\leftarrow \boldsymbol{\phi}_{t} - \alpha \frac{\mathbf{m}_{t + 1}}{\sqrt{\mathbf{v}_{t + 1}} + \delta}, \end{split} \] where \(\delta\) is a small constant preventing division by zero when the gradient magnitude is 0.
Adaptive moment estimation (Adam) uses normalized gradients but adds momentum to the gradient and the squared gradient to overcome the non-convergence problem: \[ \begin{split} \mathbf{m}_{t + 1} &\leftarrow \beta \cdot \mathbf{m}_{t} + (1 - \beta)\sum_{i \in \mathcal{B}_{t}}\frac{\partial \ell_{i}[\boldsymbol{\phi}_{t}]}{\partial \boldsymbol{\phi}} \\ \mathbf{v}_{t + 1} &\leftarrow \gamma \cdot \mathbf{v}_{t} + (1 - \gamma)\bigg(\sum_{i \in \mathcal{B}_{t}}\frac{\partial \ell_{i}[\boldsymbol{\phi}_{t}]}{\partial \boldsymbol{\phi}}\bigg)^{2}, \end{split} \] where \(\beta, \gamma \in [0, 1)\) are the momentum coefficients. Typical values for the momentum coefficients are \(\beta = 0.9\) and \(\gamma = 0.99\) (Bishop and Bishop 2024, 224).
The statistics \(\mathbf{m}_{t}\) and \(\mathbf{v}_{t}\) are initialized to 0. This results in the problem that \(\mathbf{m}_{t + 1}\) and \(\mathbf{v}_{t + 1}\) are biased towards zero during the early steps in the iterative procedure.
To counteract this bias, Adam modifies \(\mathbf{m}_{t + 1}\) and \(\mathbf{v}_{t + 1}\) before updating the parameters: \[ \begin{split} \mathbf{\tilde{m}}_{t + 1} \leftarrow \frac{\mathbf{m}_{t + 1}}{1 - \beta^{t + 1}}&, \quad \mathbf{\tilde{v}}_{t + 1} \leftarrow \frac{\mathbf{v}_{t + 1}}{1 - \gamma^{t + 1}} \\ \boldsymbol{\phi}_{t + 1} \leftarrow \boldsymbol{\phi}_{t} - \alpha & \frac{\mathbf{\tilde{m}}_{t + 1}}{\sqrt{\mathbf{\tilde{v}}_{t + 1}} + \delta}. \end{split} \]
Since \(\beta, \gamma \in [0, 1)\), the denominators \((1 - \beta^{t + 1})\) and \((1 - \gamma^{t + 1})\) approache 1 as \(t\) increases, and the bias corrections have a diminishing effect.
SGD and Adam are the most widely used optimization algorithms in deep learning.1
We previously discussed hyperparameters governing the network architecture:
We now discussed further hyperparameters, which govern the behavior of the learning algorithm:
Training models with different sets of hyperparameters and choosing the best ones is called hyperparameter search.
We so far discussed:
Models:
Loss functions:
Model training:
Up next:
Model training for neural networks:
Iterative optimization algorithms such as GD are general-purpose methods for finding the minimum of a function.
In the context of neural networks, we use these methods to find parameters that minimize the loss function.
Two issues require consideration when we use these methods to train neural networks.
We have a loss function \(L[\boldsymbol{\phi}]\), \[ L[\boldsymbol{\phi}] = \sum_{i = 1}^{N}\ell_{i} = L[f[\mathbf{x}_{i}, \boldsymbol{\phi}], y_{i}]. \]
We use SGD as the optimization algorithm for training the network, \[ \boldsymbol{\phi}_{t + 1} \leftarrow \boldsymbol{\phi}_{t} - \alpha \sum_{i \in \mathcal{B}_{t}}\frac{\partial \ell_{i}[\boldsymbol{\phi}_{t}]}{\partial \boldsymbol{\phi}}, \] where \(\alpha\) is the learning rate and \(\mathcal{B}_{t}\) contains the batch indices at iteration \(t\).
SGD requires us to calculate the derivatives \[ \frac{\partial \ell_{i}}{\partial \boldsymbol{\beta}_{k}} \quad \text{and} \quad \frac{\partial \ell_{i}}{\partial \boldsymbol{\Omega}_{k}}, \] for the parameters \(\{\boldsymbol{\beta}_{k}, \boldsymbol{\Omega}_{k}\}\) at every layer \(k = 0, 1, \dotsc, K\) and for each index \(i\) in batch \(\mathcal{B}_{t}\).
The derivatives \[ \frac{\partial \ell_{i}}{\partial \boldsymbol{\beta}_{k}} \quad \text{and} \quad \frac{\partial \ell_{i}}{\partial \boldsymbol{\Omega}_{k}} \] tell us how the loss changes when we make a small change to the parameters \(\boldsymbol{\beta}_{k}\) and \(\boldsymbol{\Omega}_{k}\).
We need to compute the derivatives of the loss \(\ell_{i}\) w.r.t. each of the biases \(\boldsymbol{\beta}_{k}\) and weights \(\boldsymbol{\Omega}_{k}\).
The backpropagation algorithm computes these derivatives via two steps:
Observation: Each weight (element of \(\boldsymbol{\Omega}_{k}\)) multiplies the activation of a source hidden unit and adds the result to a destination hidden unit in the next layer.
Therefore, any small change to the weight is amplified by the activation of the source hidden unit.
To compute the derivatives of the weights, we therefore need to calculate and store the activations of all hidden units. This is done in the forward pass.
Observation: A small change in a bias or weight causes a ripple effect of changes through the subsequent network. E.g.,
Calculation of the same quantities is required when we consider other parameters in the same or earlier layers. We can therefore calculate them once and resuse them.
The backward pass does this by first computing derivatives at the end of the network and reusing previous calculations when moving backward.1
We so far discussed:
Models:
Loss functions:
Model training:
Model training for neural networks:
Up next:
Model performance:
A neural network with sufficient capacity (no. of hidden units) will often perform perfectly on the training data.
However, this does not mean that it will generalize well to new test data, i.e., have a low test error.
There are three possible sources of test error:
Noise
Noise is inherent randomness/uncertainty in the data generation process.
Bias
Bias is the systematic deviation of our model from the mean of the true function we are approximating.
Variance
Variance is the uncertainty in our fitted model due to the particular training data set we sample.
The three sources noise, bias, and variance characterize the (expected) test error of any prediction model.
For regression models (where the noise is additive with variance \(\sigma^{2}\)) and a least squares loss, they combine additively: \[ E_{\mathcal{S}}\Big[E_{y}\big[L[x]\big]\Big] = \underbrace{E_{\mathcal{S}}\Big[\big(f[x, \boldsymbol{\phi}[\mathcal{S}]] - f_{\mu}[x]\big)^{2}\Big]}_{\text{Variance}} + \underbrace{\big(f_{\mu}[x] - \mu[x]\big)^{2}}_{\text{Bias}} + \underbrace{\sigma^{2}}_{\text{Noise}}, \] where
Panel a) Noise. Data generation is noisy. Even if the model exactly fits the true underlying function (black line), the noise in the test data (black points) means that some error will remain.
Panel b) Bias. Even with the best possible parameters, the three-region model (cyan line) cannot exactly fit the true function (black line).
Panel c) Variance. In practice, we have limited noisy training data (orange points). When we fit the model, we don’t recover the best possible function from panel b) but a slightly different function (cyan line) that reflects idiosyncrasies of the training data.
Noise
This source of error is irreducible and represents the upper limit on expected model performance.
Bias
Can be reduced by making the model more flexible (e.g., increasing network capacity by adding more hidden units).
Variance
Can be reduced by increasing the quantity of training data or, for a fixed-size training data set, making the model less flexible.
Bias-Variance Trade-Off
For a fixed-size training data set, increasing model flexibility may decrease the bias, but at the cost of increasing the variance. \(\rightarrow\) Hence, there is an optimal model flexibility.
As the model capacity increases, the bias decreases, but the variance increases.
There is a sweet spot where the combination of bias and variance reaches its minimum.
For some data sets, loss functions, and models, the test error shows a pattern different from the one predicted by the bias-variance trade-off:
We saw previously that the test performance of a model changes with its capacity.
In practice, how do we find the model capacity that leads to optimal test performance?
The capacity of a deep neural network is governed by its hyperparameters (no. of hidden layers, no. of hidden units per hidden layer, etc.).
We take an empirical approach to finding the best set of hyperparameters (hyperparameter search).
Demo 3: Fully Connected Deep Neural Networks
We so far discussed:
Models:
Loss functions:
Model training in general & Model training for neural networks:
Model performance:
Up next:
Convolutional Networks & Transformers:
We so far discussed fully connected neural networks with a single path from input to output.
Convolutional neural networks (CNNs) are a more specialized type of network with sparser connections and shared weights.
CNNs are mainly used for processing image data.
Images have three properties that call for a more specialized network architecture:
CNNs use fewer parameters than fully connected networks.
A function \(\mathbf{f}[\mathbf{x}]\) is invariant to a transformation \(\mathbf{t}[\cdot]\) if: \[ \mathbf{f}\big[\mathbf{t}[\mathbf{x}]\big] = \mathbf{f}[\mathbf{x}]. \]
A function \(\mathbf{f}[\mathbf{x}]\) is equivariant to a transformation \(\mathbf{t}[\cdot]\) if: \[ \mathbf{f}\big[\mathbf{t}[\mathbf{x}]\big] = \mathbf{t}\big[\mathbf{f}[\mathbf{x}]\big]. \]
Networks for image classification should be invariant to geometric transformations of an image.
Networks for image segmentation should be equivariant to geometric transformations of an image.
Figure 29, Image classification, panels a)-b). The goal is to categorize both images as “mountain” regardless of the horizontal shift. The network prediction should be invariant to the transformation.
Figure 29, Image segmentation, panels c)-f). The goal is to associate a label with each pixel. When the input is transformed, the output (color overlay) should be equivariant (be transformed in the same way).
Convolutional networks consist of a series of convolutional layers:
Convolutional layers perform a convolution operation, which transforms an input vector \(\mathbf{x}\) into an output vector \(\mathbf{z}\) so that each \(z_{i}\) is a weighted sum of the nearby inputs. E.g., \[ z_{i} = \omega_{1}x_{i - 1} + \omega_{2}x_{i} + \omega_{3}x_{i + 1}, \qquad(4)\] where \(\boldsymbol{\omega} = [\omega_{1}, \omega_{2}, \omega_{3}]^{T}\) is called the convolution kernel and the region over which inputs are combined is called the kernel size.
Each convolution layer employs parameter sharing (all hidden units in a layer share the same weights), so local image patches are processed similarly at every position in the image.
Equation 4 shows that each output \(z_{i}\) is a weighted sum of the three nearest inputs \(x_{i - 1}\), \(x_{i}\), and \(x_{i + 1}\).
The first output does not have an input at position \(i - 1\), while the last output does not have an input at position \(i + 1\).
There are two approaches to address this problem:
Stride. For a stride of \(k\), the kernel is evaluated at every \(k\)th position in the input vector (Figure 31, panels a)-b): first output \(z_{1}\) has kernel centered at \(x_{1}\), second output \(z_{2}\) has kernel centered at \(x_{3}\), etc.).
Kernel Size. Kernel size can be increased (Figure 31, panel c).
Dilation. Intersperse kernel values with zeros (no. of zeros is called dilation rate) (Figure 31, panel d).
A convolutional layer computes its output by convolving the input, adding a bias \(\beta\), and passing each result through an activation function \(a[\cdot]\).
Hidden unit \(h_{i}\) in a convolutional layer: \[ \begin{split} h_{i} &= a[\beta + \omega_{1}x_{i - 1} + \omega_{2}x_{i} + \omega_{3}x_{i + 1}] \\ &= a \Bigg[\beta + \sum_{j = 1}^{3}\omega_{j}x_{i + j - 2}\Bigg]. \end{split} \]
Hidden unit \(h_{i}\) in a fully-connected layer: \[ h_{i} = a \Bigg[\beta_{i} + \sum_{j = 1}^{D}\omega_{ij}x_{j}\Bigg]. \]
A convolutional layer is a special case of a fully-connected layer (with most parameters set to zero and others constrained to be identical).
If we only apply a single convolution, information will be lost (since we average nearby inputs and the ReLU activation function clips results that are below zero).
Therefore, several convolutions are typically used to produce sets of hidden units (channels).
Figure 32, panel a): a first kernel computes a weighted sum of the three nearest inputs, adds a bias, and passes the results through the activation function to produce hidden units \(h_{1}\) to \(h_{6}\) (first channel).
Figure 32, panel b): a second kernel computes a different weighted sum, adds a different bias, and passes the results through the activation function to produce hidden units \(h_{7}\) to \(h_{12}\) (second channel).
Figure 32, panel c): if a layer has \(C_{i}\) channels, the hidden units in the next layer are computed as a weighted sum over these channels.
The receptive field of a hidden unit in the network is the region of the original input that feeds into it.
Convolutional networks comprise a sequence of convolutional layers.
Figure 33, panel a): The hidden units in \(\mathbf{H}_{1}\) are weighted sums of the three nearest inputs. The receptive field in \(\mathbf{H}_{1}\) is of size 3.
Figure 33, panel b): The hidden units in \(\mathbf{H}_{2}\) are weighted sums of the three nearest positions in \(\mathbf{H}_{1}\), which are themselves weighted sums of the three nearest inputs. The receptive field in \(\mathbf{H}_{2}\) is of size 5.
Figure 33, panel c-d): By the time we add \(\mathbf{H}_{4}\), the receptive field covers the entire input.
We so far discussed convolutional networks for processing 1D data.
However, convolutional networks are typically applied to 2D image data.
With 2D inputs, a convolutional layer with a 3 \(\times\) 3 kernel \(\boldsymbol{\Omega} \in \mathbb{R}^{3 \times 3}\) computes a hidden unit \(h_{ij}\) in a hidden layer as: \[ h_{ij} = a \Bigg[\beta + \sum_{m = 1}^{3}\sum_{n = 1}^{3}\omega_{mn}x_{i + m - 2, j + n - 2}\Bigg], \] where \(\omega_{mn}\) are the entries of the convolutional kernel and \(x_{ij}\) are elements of the 2D input matrix.
In the case of convolutional networks for 1D inputs, we can use stride and dilation to scale down (downsample) the representation and increase the receptive field at each layer.
In the case of convolutional networks for 2D inputs, there are three main methods for downsampling (e.g., scaling down both dimensions of a 4 \(\times\) 4 representation by a factor of 2):
We sometimes want to scale representations back up (upsampling), such as in cases where the output is also an image.
Three common methods for upsampling are (e.g., for scaling up both dimensions of a 2 \(\times\) 2 representation by a factor of 2):
Demo 4: Convolutional Neural Networks (CNNs)
We previously discussed convolutional networks, which are well suited to processing images.
Transformers were initially developed for natural language processing (NLP) problems.
Text data have some similarities with image data.
However, text sequences can vary in length (and, unlike images, there is no easy way to resize them).
For NLP problems, we therefore need a network architecture other than fully-connected networks and convolutional networks.
Design a network to process this text into a representation for some downstream task (Prince 2024, 207):
The restaurant refused to serve me a ham sandwich because it only cooks vegetarian food. In the end, they just gave me two slices of bread. Their ambiance was just as good as the food and service.
Three problems:
The encoded input can be very large: representing each of the above 37 words by an embedding vector of length 1,024 leads to an input of length 37 \(\times\) 1,024 = 37,888.
Each input can be of a different length: unclear how we would apply a fully-connected network.
Language is ambiguous: to understand the text, the words it, they, and their should be connected to the word restaurant. This means that the former words should pay attention to the latter word.
Solution: A network for processing text that will
use parameter sharing to cope with long inputs of differing lengths;
contain connections between word representations.
A standard neural network layer \(\mathbf{f}[\mathbf{x}]\) takes a \(D \times 1\) input \(\mathbf{x}\), applies a linear transformation, and passes the result through an activation function, e.g., ReLU: \[ \mathbf{f}[\mathbf{x}] = \mathbf{ReLU}[\boldsymbol{\beta} + \boldsymbol{\Omega}\mathbf{x}], \] where \(\boldsymbol{\beta}\) contains the biases and \(\boldsymbol{\Omega}\) contains the weights.
A self-attention block \(\mathbf{sa}[\cdot]\) takes \(N\) input vectors \(\mathbf{x}_{1}, \dotsc, \mathbf{x}_{N}\) of dimension \(D \times 1\) each (e.g., \(N\) words) and returns \(N\) output vectors of the same size, \(D \times 1\).
In Equation 5, we saw that the value vectors \(\boldsymbol{\beta}_{v} + \boldsymbol{\Omega}_{v}\mathbf{x}_{m}\) are computed linearly for each input \(\mathbf{x}_{m}\).
As we saw in Equation 6, these value vectors are then combined linearly by the attention weights \(a[\mathbf{x}_{m}, \mathbf{x}_{n}]\).
However, the overall self-attention computation is nonlinear because the attention weights \(a[\mathbf{x}_{m}, \mathbf{x}_{n}]\) are nonlinear functions of the inputs.
To compute the attention weights, we calculate two more linear functions of the inputs: \[ \begin{split} \mathbf{q}_{n} &= \boldsymbol{\beta}_{q} + \boldsymbol{\Omega}_{q}\mathbf{x}_{n} \\ \mathbf{k}_{m} &= \boldsymbol{\beta}_{k} + \boldsymbol{\Omega}_{k}\mathbf{x}_{m}, \end{split} \] where the \(\{\mathbf{q}_{n}\}\) are called queries and the \(\{\mathbf{k}_{m}\}\) are called keys.
Next, we compute dot products (i.e., similarities) between the queries and the keys and pass the results through a softmax function to obtain the attention weight: \[ a[\mathbf{x}_{m}, \mathbf{x}_{n}] = \frac{\exp\big[\mathbf{k}_{m}^{T}\mathbf{q}_{n}\big]}{\sum_{m^{\prime} = 1}^{N}\exp\big[\mathbf{k}_{m^{\prime}}^{T}\mathbf{q}_{n}\big]}. \] For each \(\mathbf{x}_{n}\), these weights are nonnegative and sum to 1.
We discussed three problems characterizing the task of designing a neural network to encode and process text:
We also discussed that to overcome these problems, a network should have the following properties:
The attention mechanism has these properties:
Self attention is just one part of a transformer layer.
In a transformer layer, self attention is followed by other blocks, such as fully-connected layers.
Demo 5: Transformers